{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10 Minutes to cuDF and Dask-cuDF\n",
    "=======================\n",
    "\n",
    "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n",
    "\n",
    "### What are these Libraries?\n",
    "\n",
    "[cuDF](https://github.com/rapidsai/cudf) is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data.\n",
    "\n",
    "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple.\n",
    "\n",
    "[Dask-cuDF](https://github.com/rapidsai/dask-cudf) is a library that provides a partitioned, GPU-backed dataframe, using Dask.\n",
    "\n",
    "\n",
    "### When to use cuDF and Dask-cuDF\n",
    "\n",
    "If your workflow is fast enough on a single GPU or your data comfortably fits in memory on a single GPU, you would want to use cuDF. If you want to distribute your workflow across multiple GPUs, have more data than you can fit in memory on a single GPU, or want to analyze data spread across many files at once, you would want to use Dask-cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import cudf\n",
    "import dask_cudf\n",
    "\n",
    "np.random.seed(12)\n",
    "\n",
    "#### Portions of this were borrowed and adapted from the\n",
    "#### cuDF cheatsheet, existing cuDF documentation,\n",
    "#### and 10 Minutes to Pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Object Creation\n",
    "---------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.Series` and `dask_cudf.Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series([1,2,3,None,4])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2) \n",
    "print(ds.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n",
      "4  4  15  4\n",
      "5  5  14  5\n",
      "6  6  13  6\n",
      "7  7  12  7\n",
      "8  8  11  8\n",
      "9  9  10  9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "df = cudf.DataFrame([('a', list(range(20))),\n",
    "('b', list(reversed(range(20)))),\n",
    "('c', list(range(20)))])\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n",
      "4  4  15  4\n",
      "5  5  14  5\n",
      "6  6  13  6\n",
      "7  7  12  7\n",
      "8  8  11  8\n",
      "9  9  10  9\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "ddf = dask_cudf.from_cudf(df, npartitions=2) \n",
    "print(ddf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n",
    "\n",
    "*Note that best practice for using Dask-cuDF is to read data directly into a `dask_cudf.DataFrame` with something like `read_csv` (discussed below).*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a    b\n",
      "0  0  0.1\n",
      "1  1  0.2\n",
      "2  2     \n",
      "3  3  0.3\n"
     ]
    }
   ],
   "source": [
    "pdf = pd.DataFrame({'a': [0, 1, 2, 3],'b': [0.1, 0.2, None, 0.3]})\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a    b\n",
      "0  0  0.1\n",
      "1  1  0.2\n",
      "2  2     \n",
      "3  3  0.3\n"
     ]
    }
   ],
   "source": [
    "dask_df = dask_cudf.from_cudf(pdf, npartitions=2)\n",
    "dask_gdf = dask_cudf.from_dask_dataframe(dask_df)\n",
    "print(dask_gdf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing Data\n",
    "-------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Viewing the top rows of a GPU dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n"
     ]
    }
   ],
   "source": [
    "print(df.head(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n"
     ]
    }
   ],
   "source": [
    "print(ddf.head(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sorting by values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "19  19  0  19\n",
      "18  18  1  18\n",
      "17  17  2  17\n",
      "16  16  3  16\n",
      "15  15  4  15\n",
      "14  14  5  14\n",
      "13  13  6  13\n",
      "12  12  7  12\n",
      "11  11  8  11\n",
      "10  10  9  10\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(df.sort_values(by='b'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "0  19  0  19\n",
      "1  18  1  18\n",
      "2  17  2  17\n",
      "3  16  3  16\n",
      "4  15  4  15\n",
      "5  14  5  14\n",
      "6  13  6  13\n",
      "7  12  7  12\n",
      "8  11  8  11\n",
      "9  10  9  10\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "print(ddf.sort_values(by='b').compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selection\n",
    "------------\n",
    "\n",
    "## Getting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    2\n",
      "3    3\n",
      "4    4\n",
      "5    5\n",
      "6    6\n",
      "7    7\n",
      "8    8\n",
      "9    9\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df['a'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    2\n",
      "3    3\n",
      "4    4\n",
      "5    5\n",
      "6    6\n",
      "7    7\n",
      "8    8\n",
      "9    9\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf['a'].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows from index 2 to index 5 from columns 'a' and 'b'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "2  2  17\n",
      "3  3  16\n",
      "4  4  15\n",
      "5  5  14\n"
     ]
    }
   ],
   "source": [
    "print(df.loc[2:5, ['a', 'b']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "2  2  17\n",
      "3  3  16\n",
      "4  4  15\n",
      "5  5  14\n"
     ]
    }
   ],
   "source": [
    "print(ddf.loc[2:5, ['a', 'b']].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection by Position"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a     0\n",
      "b    19\n",
      "c     0\n",
      "Name: 0, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.iloc[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b\n",
      "0  0  19\n",
      "1  1  18\n",
      "2  2  17\n"
     ]
    }
   ],
   "source": [
    "print(df.iloc[0:3, 0:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also select elements of a `DataFrame` or `Series` with direct index access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "3  3  16  3\n",
      "4  4  15  4\n"
     ]
    }
   ],
   "source": [
    "print(df[3:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3     \n",
      "4    4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s[3:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n"
     ]
    }
   ],
   "source": [
    "print(df[df.b > 15])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c\n",
      "0  0  19  0\n",
      "1  1  18  1\n",
      "2  2  17  2\n",
      "3  3  16  3\n"
     ]
    }
   ],
   "source": [
    "print(ddf[ddf.b > 15].compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "print(df.query(\"b == 3\"))  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "print(ddf.query(\"b == 3\").compute())  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "cudf_comparator = 3\n",
    "print(df.query(\"b == @cudf_comparator\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a  b   c\n",
      "16  16  3  16\n"
     ]
    }
   ],
   "source": [
    "dask_cudf_comparator = 3\n",
    "print(ddf.query(\"b == @val\", local_dict={'val':dask_cudf_comparator}).compute())  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MultiIndex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultiIndex(levels=[array(['a', 'b'], dtype=object) array([1, 2, 3, 4])],\n",
       "codes=   0  1\n",
       "0  0  0\n",
       "1  0  1\n",
       "2  1  2\n",
       "3  1  3)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "arrays = [['a', 'a', 'b', 'b'],\n",
    "          [1, 2, 3, 4]]\n",
    "tuples = list(zip(*arrays))\n",
    "idx = cudf.MultiIndex.from_tuples(tuples)\n",
    "idx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This index can back either axis of a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        first    second\n",
      "a 1  0.154163  0.014575\n",
      "  2  0.740050  0.918747\n",
      "b 3  0.263315  0.900715\n",
      "  4  0.533739  0.033421\n"
     ]
    }
   ],
   "source": [
    "gdf1 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)})\n",
    "gdf1.index = idx\n",
    "print(gdf1.to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               a                   b          \n",
      "               1         2         3         4\n",
      "first   0.956949  0.137209  0.283828  0.606083\n",
      "second  0.944225  0.852736  0.002259  0.521226\n"
     ]
    }
   ],
   "source": [
    "gdf2 = cudf.DataFrame({'first': np.random.rand(4), 'second': np.random.rand(4)}).T\n",
    "gdf2.columns = idx\n",
    "print(gdf2.to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "        first    second\n",
      "b 3  0.263315  0.900715\n"
     ]
    }
   ],
   "source": [
    "print(gdf1.loc[('b', 3)].to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing Data\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing data can be replaced by using the `fillna` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      1\n",
      "1      2\n",
      "2      3\n",
      "3    999\n",
      "4      4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s.fillna(999))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0      1\n",
      "1      2\n",
      "2      3\n",
      "3    999\n",
      "4      4\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ds.fillna(999).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operations\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculating descriptive statistics for a `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.5 1.666666666666666\n"
     ]
    }
   ],
   "source": [
    "print(s.mean(), s.var())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.5 1.6666666666666667\n"
     ]
    }
   ],
   "source": [
    "print(ds.mean().compute(), ds.var().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Applymap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) to apply a function to each partition of the distributed dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    10\n",
      "1    11\n",
      "2    12\n",
      "3    13\n",
      "4    14\n",
      "5    15\n",
      "6    16\n",
      "7    17\n",
      "8    18\n",
      "9    19\n",
      "[10 more rows]\n",
      "Name: a, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "def add_ten(num):\n",
    "    return num + 10\n",
    "\n",
    "print(df['a'].applymap(add_ten))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    10\n",
      "1    11\n",
      "2    12\n",
      "3    13\n",
      "4    14\n",
      "5    15\n",
      "6    16\n",
      "7    17\n",
      "8    18\n",
      "9    19\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf['a'].map_partitions(add_ten).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Histogramming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Counting the number of occurrences of each unique value of variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    1\n",
      "2    1\n",
      "3    1\n",
      "4    1\n",
      "5    1\n",
      "6    1\n",
      "7    1\n",
      "8    1\n",
      "9    1\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.a.value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    1\n",
      "2    1\n",
      "3    1\n",
      "4    1\n",
      "5    1\n",
      "6    1\n",
      "7    1\n",
      "8    1\n",
      "9    1\n",
      "[10 more rows]\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ddf.a.value_counts().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       a\n",
      "1       b\n",
      "2       c\n",
      "3    aaba\n",
      "4    baca\n",
      "5    None\n",
      "6    caba\n",
      "7     dog\n",
      "8     cat\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series(['A', 'B', 'C', 'Aaba', 'Baca', None, 'CABA', 'dog', 'cat'])\n",
    "print(s.str.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0       a\n",
      "1       b\n",
      "2       c\n",
      "3    aaba\n",
      "4    baca\n",
      "5    None\n",
      "6    caba\n",
      "7     dog\n",
      "8     cat\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "ds = dask_cudf.from_cudf(s, npartitions=2)\n",
    "print(ds.str.lower().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenating `Series` and `DataFrames` row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "s = cudf.Series([1, 2, 3, None, 5])\n",
    "print(cudf.concat([s, s]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n",
    "print(dask_cudf.concat([ds2, ds2]).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   key  vals_a  vals_b\n",
      "0    a    10.0   100.0\n",
      "1    c    12.0   101.0\n",
      "2    e    14.0   102.0\n",
      "3    b    11.0        \n",
      "4    d    13.0        \n"
     ]
    }
   ],
   "source": [
    "df_a = cudf.DataFrame()\n",
    "df_a['key'] = ['a', 'b', 'c', 'd', 'e']\n",
    "df_a['vals_a'] = [float(i + 10) for i in range(5)]\n",
    "\n",
    "df_b = cudf.DataFrame()\n",
    "df_b['key'] = ['a', 'c', 'e']\n",
    "df_b['vals_b'] = [float(i+100) for i in range(3)]\n",
    "\n",
    "merged = df_a.merge(df_b, on=['key'], how='left')\n",
    "print(merged)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   key  vals_a  vals_b\n",
      "0    a    10.0   100.0\n",
      "1    c    12.0   101.0\n",
      "2    b    11.0        \n",
      "0    e    14.0   102.0\n",
      "1    d    13.0        \n"
     ]
    }
   ],
   "source": [
    "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n",
    "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n",
    "\n",
    "merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n",
    "print(merged)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Append"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Appending values from another `Series` or array-like object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(s.append(s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "0    1\n",
      "1    2\n",
      "2    3\n",
      "3     \n",
      "4    5\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(ds2.append(ds2).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['agg_col1'] = [1 if x % 2 == 0 else 0 for x in range(len(df))]\n",
    "df['agg_col2'] = [1 if x % 3 == 0 else 0 for x in range(len(df))]\n",
    "\n",
    "ddf = dask_cudf.from_cudf(df, npartitions=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping and then applying the `sum` function to the grouped data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c  agg_col2\n",
      "agg_col1\n",
      "0  100   90  100         3\n",
      "1   90  100   90         4\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby('agg_col1').sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a    b    c  agg_col2\n",
      "0  100   90  100         3\n",
      "1   90  100   90         4\n"
     ]
    }
   ],
   "source": [
    "print(ddf.groupby('agg_col1').sum().compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping hierarchically then applying the `sum` function to grouped data. We send the result to a pandas dataframe only for printing purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                    a   b   c\n",
      "agg_col1 agg_col2            \n",
      "0        0         73  60  73\n",
      "         1         27  30  27\n",
      "1        0         54  60  54\n",
      "         1         36  40  36\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby(['agg_col1', 'agg_col2']).sum().to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>agg_col1</th>\n",
       "      <th>agg_col2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">0</th>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>60</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27</td>\n",
       "      <td>30</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">1</th>\n",
       "      <th>0</th>\n",
       "      <td>54</td>\n",
       "      <td>60</td>\n",
       "      <td>54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>36</td>\n",
       "      <td>40</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    a   b   c\n",
       "agg_col1 agg_col2            \n",
       "0        0         73  60  73\n",
       "         1         27  30  27\n",
       "1        0         54  60  54\n",
       "         1         36  40  36"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.groupby(['agg_col1', 'agg_col2']).sum().compute().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Grouping and applying statistical functions to specific columns, using `agg`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a     b    c\n",
      "agg_col1\n",
      "0  19   9.0  100\n",
      "1  18  10.0   90\n"
     ]
    }
   ],
   "source": [
    "print(df.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    a     b    c\n",
      "0  19   9.0  100\n",
      "1  18  10.0   90\n"
     ]
    }
   ],
   "source": [
    "print(ddf.groupby('agg_col1').agg({'a':'max', 'b':'mean', 'c':'sum'}).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transpose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a  b\n",
      "0  1  4\n",
      "1  2  5\n",
      "2  3  6\n"
     ]
    }
   ],
   "source": [
    "sample = cudf.DataFrame({'a':[1,2,3], 'b':[4,5,6]})\n",
    "print(sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   0  1  2\n",
      "a  1  2  3\n",
      "b  4  5  6\n"
     ]
    }
   ],
   "source": [
    "print(sample.transpose())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time Series\n",
    "------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                     date               value\n",
      "0 2018-11-20T00:00:00.000  0.5520376332645666\n",
      "1 2018-11-21T00:00:00.000  0.4853774136627097\n",
      "2 2018-11-22T00:00:00.000  0.7681341540644223\n",
      "3 2018-11-23T00:00:00.000  0.1607167531255701\n"
     ]
    }
   ],
   "source": [
    "import datetime as dt\n",
    "\n",
    "date_df = cudf.DataFrame()\n",
    "date_df['date'] = pd.date_range('11/20/2018', periods=72, freq='D')\n",
    "date_df['value'] = np.random.sample(len(date_df))\n",
    "\n",
    "search_date = dt.datetime.strptime('2018-11-23', '%Y-%m-%d')\n",
    "print(date_df.query('date <= @search_date'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                     date               value\n",
      "0 2018-11-20T00:00:00.000  0.5520376332645666\n",
      "1 2018-11-21T00:00:00.000  0.4853774136627097\n",
      "2 2018-11-22T00:00:00.000  0.7681341540644223\n",
      "3 2018-11-23T00:00:00.000  0.1607167531255701\n"
     ]
    }
   ],
   "source": [
    "date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)\n",
    "print(date_ddf.query('date <= @search_date', local_dict={'search_date':search_date}).compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categoricals\n",
    "------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DataFrames` support categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   id  grade\n",
      "0   1      a\n",
      "1   2      b\n",
      "2   3      b\n",
      "3   4      a\n",
      "4   5      a\n",
      "5   6      e\n"
     ]
    }
   ],
   "source": [
    "pdf = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
    "pdf[\"grade\"] = pdf[\"grade\"].astype(\"category\")\n",
    "\n",
    "gdf = cudf.DataFrame.from_pandas(pdf)\n",
    "print(gdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   id  grade\n",
      "0   1      a\n",
      "1   2      b\n",
      "2   3      b\n",
      "3   4      a\n",
      "4   5      a\n",
      "5   6      e\n"
     ]
    }
   ],
   "source": [
    "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n",
    "print(dgdf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('a', 'b', 'e')"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gdf.grade.cat.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accessing the underlying code values of each categorical observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    1\n",
      "3    0\n",
      "4    0\n",
      "5    2\n",
      "dtype: int8\n"
     ]
    }
   ],
   "source": [
    "print(gdf.grade.cat.codes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    0\n",
      "1    1\n",
      "2    1\n",
      "0    0\n",
      "1    0\n",
      "2    2\n",
      "dtype: int8\n"
     ]
    }
   ],
   "source": [
    "print(dgdf.grade.cat.codes.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting Data Representation\n",
    "--------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n"
     ]
    }
   ],
   "source": [
    "print(df.head().to_pandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n"
     ]
    }
   ],
   "source": [
    "print(ddf.compute().head().to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0 19  0  1  1]\n",
      " [ 1 18  1  0  0]\n",
      " [ 2 17  2  1  0]\n",
      " [ 3 16  3  0  1]\n",
      " [ 4 15  4  1  0]\n",
      " [ 5 14  5  0  0]\n",
      " [ 6 13  6  1  1]\n",
      " [ 7 12  7  0  0]\n",
      " [ 8 11  8  1  0]\n",
      " [ 9 10  9  0  1]\n",
      " [10  9 10  1  0]\n",
      " [11  8 11  0  0]\n",
      " [12  7 12  1  1]\n",
      " [13  6 13  0  0]\n",
      " [14  5 14  1  0]\n",
      " [15  4 15  0  1]\n",
      " [16  3 16  1  0]\n",
      " [17  2 17  0  0]\n",
      " [18  1 18  1  1]\n",
      " [19  0 19  0  0]]\n"
     ]
    }
   ],
   "source": [
    "print(df.as_matrix())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0 19  0  1  1]\n",
      " [ 1 18  1  0  0]\n",
      " [ 2 17  2  1  0]\n",
      " [ 3 16  3  0  1]\n",
      " [ 4 15  4  1  0]\n",
      " [ 5 14  5  0  0]\n",
      " [ 6 13  6  1  1]\n",
      " [ 7 12  7  0  0]\n",
      " [ 8 11  8  1  0]\n",
      " [ 9 10  9  0  1]\n",
      " [10  9 10  1  0]\n",
      " [11  8 11  0  0]\n",
      " [12  7 12  1  1]\n",
      " [13  6 13  0  0]\n",
      " [14  5 14  1  0]\n",
      " [15  4 15  0  1]\n",
      " [16  3 16  1  0]\n",
      " [17  2 17  0  0]\n",
      " [18  1 18  1  1]\n",
      " [19  0 19  0  0]]\n"
     ]
    }
   ],
   "source": [
    "print(ddf.compute().as_matrix())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,\n",
       "       17, 18, 19])"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(df['a'].to_array())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]\n"
     ]
    }
   ],
   "source": [
    "print(ddf['a'].compute().to_array())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Arrow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pyarrow.Table\n",
      "a: int64\n",
      "b: int64\n",
      "c: int64\n",
      "agg_col1: int64\n",
      "agg_col2: int64\n",
      "__index_level_0__: int64\n",
      "metadata\n",
      "--------\n",
      "OrderedDict([(b'pandas',\n",
      "              b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n",
      "              b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n",
      "              b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n",
      "              b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n",
      "              b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n",
      "              b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n",
      "              b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n",
      "              b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n",
      "              b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n",
      "              b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n",
      "              b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n",
      "              b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n",
      "              b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n",
      "              b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n",
      "              b's_version\": \"0.23.4\"}')])\n"
     ]
    }
   ],
   "source": [
    "print(df.to_arrow())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pyarrow.Table\n",
      "a: int64\n",
      "b: int64\n",
      "c: int64\n",
      "agg_col1: int64\n",
      "agg_col2: int64\n",
      "__index_level_0__: int64\n",
      "metadata\n",
      "--------\n",
      "OrderedDict([(b'pandas',\n",
      "              b'{\"index_columns\": [\"__index_level_0__\"], \"column_indexes\": ['\n",
      "              b'{\"name\": null, \"field_name\": null, \"pandas_type\": \"unicode\",'\n",
      "              b' \"numpy_type\": \"object\", \"metadata\": {\"encoding\": \"UTF-8\"}}]'\n",
      "              b', \"columns\": [{\"name\": \"a\", \"field_name\": \"a\", \"pandas_type\"'\n",
      "              b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"name\"'\n",
      "              b': \"b\", \"field_name\": \"b\", \"pandas_type\": \"int64\", \"numpy_typ'\n",
      "              b'e\": \"int64\", \"metadata\": null}, {\"name\": \"c\", \"field_name\": '\n",
      "              b'\"c\", \"pandas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadat'\n",
      "              b'a\": null}, {\"name\": \"agg_col1\", \"field_name\": \"agg_col1\", \"p'\n",
      "              b'andas_type\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": nul'\n",
      "              b'l}, {\"name\": \"agg_col2\", \"field_name\": \"agg_col2\", \"pandas_t'\n",
      "              b'ype\": \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}, {\"n'\n",
      "              b'ame\": null, \"field_name\": \"__index_level_0__\", \"pandas_type\"'\n",
      "              b': \"int64\", \"numpy_type\": \"int64\", \"metadata\": null}], \"panda'\n",
      "              b's_version\": \"0.23.4\"}')])\n"
     ]
    }
   ],
   "source": [
    "print(ddf.compute().to_arrow())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting Data In/Out\n",
    "------------------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to a CSV file, by first sending data to a pandas `Dataframe` on the host."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists('example_output'):\n",
    "    os.mkdir('example_output')\n",
    "    \n",
    "df.to_pandas().to_csv('example_output/foo.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ddf.compute().to_pandas().to_csv('example_output/foo_dask.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading from a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n",
      "5  5  14  5         0         0\n",
      "6  6  13  6         1         1\n",
      "7  7  12  7         0         0\n",
      "8  8  11  8         1         0\n",
      "9  9  10  9         0         1\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "df = cudf.read_csv('example_output/foo.csv')\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n",
      "5  5  14  5         0         0\n",
      "6  6  13  6         1         1\n",
      "7  7  12  7         0         0\n",
      "8  8  11  8         1         0\n",
      "9  9  10  9         0         1\n",
      "[10 more rows]\n"
     ]
    }
   ],
   "source": [
    "ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n",
    "print(ddf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   a   b  c  agg_col1  agg_col2\n",
      "0  0  19  0         1         1\n",
      "1  1  18  1         0         0\n",
      "2  2  17  2         1         0\n",
      "3  3  16  3         0         1\n",
      "4  4  15  4         1         0\n",
      "5  5  14  5         0         0\n",
      "6  6  13  6         1         1\n",
      "7  7  12  7         0         0\n",
      "8  8  11  8         1         0\n",
      "9  9  10  9         0         1\n",
      "[30 more rows]\n"
     ]
    }
   ],
   "source": [
    "ddf = dask_cudf.read_csv('example_output/*.csv')\n",
    "print(ddf.compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parquet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to parquet files, using the CPU via PyArrow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/parquet.py:56: UserWarning: Using CPU via PyArrow to write Parquet dataset, this will be GPU accelerated in the future\n",
      "  warnings.warn(\"Using CPU via PyArrow to write Parquet dataset, this will \"\n"
     ]
    }
   ],
   "source": [
    "df.to_parquet('example_output/temp_parquet')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading parquet files with a GPU-accelerated parquet reader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     a   b   c  agg_col1  agg_col2\n",
      "0    0  19   0         1         1\n",
      "1    1  18   1         0         0\n",
      "2    2  17   2         1         0\n",
      "3    3  16   3         0         1\n",
      "4    4  15   4         1         0\n",
      "5    5  14   5         0         0\n",
      "6    6  13   6         1         1\n",
      "7    7  12   7         0         0\n",
      "8    8  11   8         1         0\n",
      "9    9  10   9         0         1\n",
      "10  10   9  10         1         0\n",
      "11  11   8  11         0         0\n",
      "12  12   7  12         1         1\n",
      "13  13   6  13         0         0\n",
      "14  14   5  14         1         0\n",
      "15  15   4  15         0         1\n",
      "16  16   3  16         1         0\n",
      "17  17   2  17         0         0\n",
      "18  18   1  18         1         1\n",
      "19  19   0  19         0         0\n"
     ]
    }
   ],
   "source": [
    "df = cudf.read_parquet('example_output/temp_parquet/72706b163a0d4feb949005d22146ad83.parquet')\n",
    "print(df.to_pandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf.to_parquet('example_files')  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ORC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading ORC files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>boolean1</th>\n",
       "      <th>byte1</th>\n",
       "      <th>short1</th>\n",
       "      <th>int1</th>\n",
       "      <th>long1</th>\n",
       "      <th>float1</th>\n",
       "      <th>double1</th>\n",
       "      <th>bytes1</th>\n",
       "      <th>string1</th>\n",
       "      <th>middle.list.int1</th>\n",
       "      <th>middle.list.string1</th>\n",
       "      <th>list.int1</th>\n",
       "      <th>list.string1</th>\n",
       "      <th>map</th>\n",
       "      <th>map.int1</th>\n",
       "      <th>map.string1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "      <td>1024</td>\n",
       "      <td>65536</td>\n",
       "      <td>9223372036854775807</td>\n",
       "      <td>1.0</td>\n",
       "      <td>-15.0</td>\n",
       "      <td>\u0000\u0001\u0002\u0003\u0004</td>\n",
       "      <td>hi</td>\n",
       "      <td>3</td>\n",
       "      <td>bye</td>\n",
       "      <td>4</td>\n",
       "      <td></td>\n",
       "      <td>chani</td>\n",
       "      <td>5</td>\n",
       "      <td>chani</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>100</td>\n",
       "      <td>2048</td>\n",
       "      <td>65536</td>\n",
       "      <td>9223372036854775807</td>\n",
       "      <td>2.0</td>\n",
       "      <td>-5.0</td>\n",
       "      <td></td>\n",
       "      <td>bye</td>\n",
       "      <td>0</td>\n",
       "      <td>bye</td>\n",
       "      <td>0</td>\n",
       "      <td></td>\n",
       "      <td>mauddib</td>\n",
       "      <td>1</td>\n",
       "      <td>mauddib</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   boolean1  byte1  short1   int1                long1  float1  double1  \\\n",
       "0     False      1    1024  65536  9223372036854775807     1.0    -15.0   \n",
       "1      True    100    2048  65536  9223372036854775807     2.0     -5.0   \n",
       "\n",
       "  bytes1 string1  middle.list.int1 middle.list.string1  list.int1  \\\n",
       "0  \u0000\u0001\u0002\u0003\u0004      hi                 3                 bye          4   \n",
       "1            bye                 0                 bye          0   \n",
       "\n",
       "  list.string1      map  map.int1 map.string1  \n",
       "0                 chani         5       chani  \n",
       "1               mauddib         1     mauddib  "
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = cudf.read_orc('/cudf/python/cudf/tests/data/orc/TestOrcFile.test1.orc')\n",
    "df2.to_pandas()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
